123 research outputs found

    Identification of SNP interactions using logic regression

    Get PDF
    Interactions of single nucleotide polymorphisms (SNPs) are assumed to be responsible for complex diseases such as sporadic breast cancer. Important goals of studies concerned with such genetic data are thus to identify combinations of SNPs that lead to a higher risk of developing a disease and to measure the importance of these interactions. There are many approaches based on classification methods such as CART and Random Forests that allow measuring the importance of single variables. But with none of these methods the importance of combinations of variables can be quantified directly. In this paper, we show how logic regression can be employed to identify SNP interactions explanatory for the disease status in a case- control study and propose two measures for quantifying the importance of these interactions for classification. These approaches are then applied, on the one hand, to simulated data sets, and on the other hand, to the SNP data of the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. --Single Nucleotide Polymorphism,Feature Selection,Variable Importance Measure,GENICA

    Similarity Measures for Clustering SNP Data

    Get PDF
    The issue of suitable similarity measures for a particular kind of genetic data – so called SNP data – arises from the GENICA (Interdisciplinary Study Group on Gene Environment Interaction and Breast Cancer in Germany) case-control study of sporadic breast cancer. The GENICA study aims to investigate the influence and interaction of single nucleotide polymorphic (SNP) loci and exogenous risk factors. A single nucleotide polymorphism is a point mutation that is present in at least 1 % of a population. SNPs are the most common form of human genetic variations. In particular, we consider 65 SNP loci and 2 insertions of longer sequences in genes involved in the metabolism of hormones, xenobiotics and drugs as well as in the repair of DNA and signal transduction. Assuming that these single nucleotide changes may lead, for instance, to altered enzymes or to a reduced or enhanced amount of the original enzymes – with each alteration alone having minor effects – we aim to detect combinations of SNPs that under certain environmental conditions increase the risk of sporadic breast cancer. The search for patterns in the present data set may be performed by a variety of clustering and classification approaches. We consider here the problem of suitable measures of proximity of two variables or subjects as an indispensable basis for a further cluster analysis. Generally, clustering approaches are a useful tool to detect structures and to generate hypothesis about potential relationships in complex data situations. Searching for patterns in the data there are two possible objectives: the identification of groups of similar objects or subjects or the identification of groups of similar variables within the whole or within subpopulations. Comparing the individual genetic profiles as well as comparing the genetic information across subpopulations we discuss possible choices of similarity measures, in particular similarity measures based on the counts of matches and mismatches. New matching coefficients are introduced with a more flexible weighting scheme to account for the general problem of the comparison of SNP data: The large proportion of homozygous reference sequences relative to the homo- and heterozygous SNPs is masking the accordances and differences of interest. --GENICA,single nucleotide polymorphism (SNP),sporadic breast cancer,similarity,Matching Coefficient,Flexible Matching Coefficient

    Imputing missing genotypes with weighted k nearest neighbors

    Get PDF
    Motivation: Missing values are a common problem in genetic association studies concerned with single nucleotide polymorphisms (SNPs). Since most statistical methods cannot handle missing values, they have to be removed prior to the actual analysis. Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are needed that can be used to replace such missing values. In this article, we propose a method based on weighted k nearest neighbors that can be employed for imputing such missing genotypes. Results: In a comparison to other imputation approaches, our procedure called KNNcatImpute shows the lowest rates of falsely imputed genotypes when applied to the SNP data from the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. Moreover, in contrast to other imputation methods that take all variables into account when replacing missing values of a particular variable, KNNcatImpute is not restricted to association studies comprising several ten to a few hundred SNPs, but can also be applied to data from whole-genome studies, as an application to a subset of the HapMap data shows. --

    Cluster Analysis : A Comparison of Different Similarity Measures for SNP Data

    Get PDF
    The issue of suitable similarity measures for a particular kind of genetic data - so called SNP data - arises, e.g., from the GENICA (The Interdisciplinary Study Group on Gene Environmental Interactions and Breast Cancer in Germany) case-control study of sporadic breast cancer. The GENICA study aims to investigate the influence and interaction of single nucleotide polymorphic (SNP) loci and exogenous risk factors. It is very unlikely that there exists one main effect, say only one polymorphism, being responsible for such a complex disease as sporadic breast cancer as the role of a single gene within the carcinogenic process is limited (Garte, 2001). Nevertheless, it is assumed that a number of interacting SNPs in combination with certain environmental risk factors increase the individual susceptibility. The search for SNP patterns in the present data set may be performed by a variety of clustering and classification approaches. Here we consider the problem of adequate similarity measures for variables or subjects as an indispensable basis for a further cluster analysis. The term ?similarity? is still vague for SNP data. A main problem arises by the general structure of such data sets: the proportion of hetero- or homozygous SNPs is rather small compared with the homozygous reference sequence. Thus, the relevant information of combinations of genetic alterations is often masked by a huge amount of common occurrences of homozygous reference types. Therefore, we examine different similarity measures, conventional ones as well as new coefficients which we created especially for SNP data. Furthermore, we compare the resulting partitions with each other adapting the clustering of clustering methods of Rand (1971) for different similarity measures. --cluster analysis,clustering clustering methods,GENICA,similarity,single nucleotide polymorphism,sporadic breast cancer

    Comparison of the empirical bayes and the significance analysis of microarrays

    Get PDF
    Microarrays enable to measure the expression levels of tens of thousands of genes simultaneously. One important statistical question in such experiments is which of the several thousand genes are differentially expressed. Answering this question requires methods that can deal with multiple testing problems. One such approach is the control of the False Discovery Rate (FDR). Two recently developed methods for the identification of differentially expressed genes and the estimation of the FDR are the SAM (Significance Analysis of Microarrays) procedure and an empirical Bayes approach. In the two group case, both methods are based on a modified version of the standard t-statistic. However, it is also possible to use the Wilcoxon rank sum statistic. While there already exists a version of the empirical Bayes approach based on this rank statistic, we introduce in this paper a new version of SAM based on Wilcoxon rank sums. We furthermore compare these four procedures by applying them to simulated and real gene expression data. --Identification of differentially expressed genes,Gene expression,Multiple Testing,False Discovery Rate

    Random projections for Bayesian regression

    Get PDF
    This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire dd-dimensional distribution is approximately preserved under random projections by reducing the number of data points from nn to kO(poly(d/ε))k\in O(\operatorname{poly}(d/\varepsilon)) in the case ndn\gg d. Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a (1+O(ε))(1+O(\varepsilon))-approximation in terms of the 2\ell_2 Wasserstein distance. Our main result shows that the posterior distribution of Bayesian linear regression is approximated up to a small error depending on only an ε\varepsilon-fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions over Rd\mathbb{R}^d for β\beta. Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model up to small error while considerably reducing the total running time
    corecore